Data Collection

Data is all around us: think about weather forecasts, traffic data on Google Maps, statistics of your favourite sports teams, or even the number of likes on an Instagram post. Every day, we use this data to make decisions. We use weather data to decide what to wear; traffic data to decide which route to take to university; sports statistics to decide if our team has a chance to win their next match; social media data to decide if an influencer is popular or not; and much more.

Figure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/womanFigure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/womanFigure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/womanFigure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/woman

Figure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/woman

Any statistical model or analysis is only as good as the input data. No matter how sophisticated the analysis, if the input data is poor, the output results will not be any good. This is known as the Garbage-In-Garbage-Out (GIGO) principle.

Figure 2: The GIGO principle

As good data analysts, it is our responsibility to ensure that, whenever we are involved in data collection, it is done correctly, in a systematic, accurate, and unbiased way. The goal of statistical data collection is to gather information that can be used to find patterns and trends, accurately answer questions, test hypotheses, and make evidence-based decisions.

The process of data collection is multi-faceted. It involves identifying what we want to study, choosing the right methods to collect the data (like surveys, experiments, observations, or finding secondary or tertiary data sources), and ensuring the information is reliable and representative. By learning about data collection, you will gain tools to make informed decisions in a variety of fields, from science and business to social issues. It will also equip you, as a data analyst, to help others to do so.

Q: Can you think of other sources of data that you use to make daily decisions? How reliable do you think these sources are?

Planning data collection

Before you start collecting data, it is crucial that you plan how you are going to do it. You will need to ask yourself (at least) the following questions:

  1. What is the question I am trying to answer?
  2. What/who is the population I am studying?
  3. What kind of information do I need from the population to answer this question?
  4. How can I obtain a dataset that will be representative of the information I need from this population?
  5. How can I obtain such a dataset in an ethical way?

Example 1: Raheem wants to open a new resaurant on Hatfield campus. Before doing this, he wants to know what the students think of the restaurants and fast food places already available on campus, in terms of the cost, freshness variety and taste of food. Following the data collection planning questions above, he gives the following answers:

  1. What is the question I am trying to answer? What are the perceptions of students on Hatfield campus about the cost, freshness, variety and taste of food already available on campus?
  2. What/who is the population I am studying? Students on Hatfield campus.
  3. What kind of information do I need from the population to answer this question? Answers from students about their perceptions of the available food options.
  4. How can I obtain a dataset that will be representative of the information I need from this population? I can conduct a survey of students on Hatfield campus.
  5. How can I obtain such a dataset in an ethical way? By obtaining the relevant permissions to conduct such a survey from the University, and to obtain clear consent from every student I ask, after I have explained the purpose of the survey. I will also keep the students’ answers anonymous.

Example 2: Tebogo is an analyst for a private security firm operating on the Hatfield City Improvement District (CID). Her security firm has recently employed a new strategy to combat theft. She wants to know whether thefts have decreased since they employed the strategy. She answers the questions as follows:

  1. What is the question I am trying to answer? Have thefts in the Hatfield CID decreased since our security firm employed its new theft-prevention strategy?
  2. What/who is the population I am studying? Everyone working in, living in, or travelling through the Hatfield CID.
  3. What kind of information do I need from the population to answer this question? Theft statistics from all relevant police stations whose precincts overlap with the CID.
  4. How can I obtain a dataset that will be representative of the information I need from this population? I can request crime statistics from the relevant police stations.
  5. How can I obtain such a dataset in an ethical way? By obtaining the relevant permissions from the SAPD, the relevant persons at the police stations themselves, and signing any necessary agreements about my use of the data.

Example 3: William is an ecologist who wants to determine if a new pesticide-free anti-fungal treatment he has developed will keep maize safe from fungal infections. Below are his answers to the data collection planning questions:

  1. What is the question I am trying to answer? Whether my new anti-fungal treatment works to protect maize from fungal infections.
  2. What/who is the population I am studying? Maize plants.
  3. What kind of information do I need from the population to answer this question? Data on the health of maize plants that were given the treatment, and maize plants that were not given the treatment, when exposed to fungi.
  4. How can I obtain a dataset that will be representative of the information I need from this population? By planting two fields of maize, giving one the anti-fungal treatment and leaving the other without treatment, and then exposing them both to the fungus.
  5. How can I obtain such a dataset in an ethical way? By making sure the fungus cannot spread to any other plants or crops.

Class Exercise Question 1: You are asked to determine the favourite movie of first-year mathematical sciences students in your class. Answer the data collection planning questions at the beginning of this section to determine what your intended dataset is, and how you will collect it. Then, collect the data by asking some of the other students in the class what their favourite movie is.

Reminder of the data collection planning questions: 1. What is the question I am trying to answer? 2. What/who is the population I am studying? 3. What kind of information do I need from the population to answer this question? 4. How can I obtain a dataset that will be representative of the information I need from this population? 5. How can I obtain such a dataset in an ethical way?

Primary data collection

Primary data is collected first-hand by the researcher in order to answer a specific question or questions. Examples of primary data collection include conducting interviews and surveys to ask about people’s opinions and experiences; conducting experiments in a laboratory; collecting field data, such as animal tracking data; and taking direct measurements (e.g. the chemistry of plants, or the weight of animals).

Primary data is complex and resource-intensive to collect. It also requires an in-depth understanding of the answers to the data collection planning questions in the previous section. When you collect primary data, it is your responsibility to ensure that the correct data is collected in the correct way, and that the data is representative, unbiased, and ethical. We will learn more about representative and unbiased data in the section on evaluating data.

Exercise: For each of Examples 1-3 in the previous section, identify the type of primary data collected (survey, experiment, field data, or direct measurements), or indicate if it was not primary data.

Class Exercise Question 2: Is the dataset from Class Exercise Question 1 a primary, secondary or tertiary dataset? Can you obtain the data through surveys, interviews, experiments, field data, or direct measurements?

Secondary data collection

Secondary data is data that was collected by a different researcher for a purpose that is different from the current study. Examples of secondary data include data from the national census, marketing data collected by a company, crime data, social media data, and more.

Secondary data collection is usually done in one of the following ways:

  • By approaching the custodian of the primary data and obtaining their permission to use the data. This is usually done with sensitive data like public health or crime data.
  • By downloading an open dataset from the internet.
  • By webscraping or performing other techniques to gather data from the internet.

Q: Can you think of other examples of secondary data, and how you would collect them?

The main challenge when collecting secondary data is to make sure that it is the correct data to answer your research question. Even though secondary data was collected by someone else, you as the analyst still need to ensure that the data is of good quality, and ethical. Although the original data was not collected by you, you are still responsible for the ethics of the data as it pertains to your study. If the data was collected unethically, you could still face consequences for using it. This means that you cannot assume that the data is relevant and ethical.

Exercise: For each of Examples 1-3, if the data collected was not primary, identify the type of secondary data collected.

Evaluating Data

Once you have collected data (whether it is primary or secondary), you need to be able to determine if the data is good and fit for use. This section explains the attributes of a good dataset, and how to check if it is relevant to the research question at hand. The key aspects of a good dataset include relevance, quality, representativeness, unbiasedness, and impact.

Relevance

Relevant data is data that is applicable to the research question at hand. The analyst must be able to use this data to answer their research question. It must also be up-to-date for the purpose of the study. It is important to ensure that data is relevant, since irrelevant or redundant data can clutter the analysis and reduce the efficiency of the study.

For example, if an insurer wants to answer a question about short-term insurance in 2024, a dataset on long-term insurance in 2024 would be irrelevant. Similarly, a dataset on short-term insurance in 1996 would be irrelevant.

In order to ensure that primary data is relevant, one should collect only necessary data and regularly review datasets for alignment with the research objectives. For secondary data, one should determine the scope of the data and date of collection.

Quality

The data must be of good quality. This includes completeness and consistency.

A dataset is complete if it has minimal missing or incomplete data. Gaps in data can distort the analysis, or require assumptions that may not be valid.

In order to ensure that primary data is complete, one should clearly indicate and document missing data, and where possible, implement strategies to fill gaps responsibly. For secondary data, one should determine if any data is missing. If there is a large amount of missing data, this may indicate that the dataset is not suitable. If there are minimal missing values, one should use reliable techniques to impute missing data without making any undue assumptions.

As an example, missing data often arises in longitudinal health studies. These studies typically attempt to determine the status of patients over time. Missing values occur when patients do not show up for follow-up appointments and drop out of the study without explaining why. This can happen if they simply forget their follow-up appointments; if they feel better, and no longer feel it is necessary to visit the hospital; or if they move away; or for a host of other reasons. In such studies, it is therefore crucial to educate the participating patients on the need to attend their follow-up visits if they are able to.

A dataset is consistent if the data was recorded in a uniform and standardised manner across the dataset. Inconsistent data formats (e.g., different date formats or units) can complicate the analysis and increase the risk of errors.

For consistency of primary data, it is important to use standardised units of measurements (if applicable), standard questions with clear instructions on how to answer them (if applicable), standardised data formats, and data entry procedures. For secondary data, it is important to investigate whether the data formats and units are consistent across the dataset. If they are not consistent, this should be remedied before the data can be used.

Inconsistencies slip into datasets more easily than one might suppose. For example, if seven people are asked to write the date that classes started in 2025, they could each write it in a different way:

  1. 10/02/25
  2. 10 Feb 25
  3. 10th of Feb 2025
  4. 10/02/2025
  5. 10 February 2025
  6. 02/10/2025 (This one is strange, but it follows the date format used in the USA, MM/DD/YYYY. Sometimes, people’s phones or laptops might be set to the USA format by default, which can lead to these errors.)
  7. 2025/02/10 (This format, YYYY/MM/DD, is commonly used in Europe.)

If these answers were entered into an Excel spreadsheet, for example, Excel might not recognise all of them as dates, or might think that the date at item 6. is actually referring to the 2nd of October 2025. To ensure consistent dates, for instance, one could provide an example of a date (e.g. 31/12/2024) or a standard date format (e.g. DD/MM/YYYY, a common South African date format).

Representativeness and unbiasedness

The data must accurately represent the population being studied. In most cases, it is impossible to capture data about the entire population of interest. In Example 1, for instance, Raheem cannot possibly send surveys to every single student on Hatfield campus. However, it is very important that he get data that represents the students on Hatfield campus. If he only targeted students who studied business sciences, they would likely frequent the food outlets on that part of the campus. This would not represent students who studied the humanities (as they are closer and thus more likely to visit food outlets in the Piazza) or students who studied engineering or science (as they are closer to food outlets around the Aula lawn).

Discuss: Consider Examples 2 and 3. In each of these examples, discuss whether it is possible to obtain data on the whole population under study, and what the factors are that limit how much of the population can be observed.

An unrepresentative dataset is in danger of being biased. Bias occurs when the data over-represents some members of the population, and under-represents others, or if it represents some members of the population in an unduly positive or negative light. In Raheem’s case, the worst that could happen is that he might make an unsound business decision. But, in the real world, bias can have extremely serious consequences, such as denying an applicant a bank loan based on their gender or race.

Bias can enter a dataset during collection, processing, or even during the interpretation of results. We will consider the following types of bias: selection, measurement, sampling, confirmation, and historical bias.

  1. Selection bias occurs when the collected data is not representative of the population being studied. In Raheem’s case, only handing out the surveys to students purchasing on-campus meals would exclude students who did not buy food on campus. This could exclude students who do not buy on campus meals for financial, health, or other reasons, which could lead to a loss of valuable information for his business.

  2. Measurement bias happens when data collection tools or processes systematically record data incorrectly. In Example 3, William needs to measure the response of maize to the fungal infection using specific tools. If one of the tools were defective, this would introduce a systematic error into his dataset.

  3. Sampling bias arises when some members of a population are more likely to be included in the sample than others. For example, a poll on eating habits that only targeted shoppers at a butchery would not be likely to represent any vegetarians. This would undervalue vegetarians’ opinions and experiences.

  4. Confirmation bias arises when data collectors (or analysts) unintentionally focus on results that align with what they expect to see. In Example 3, William might unintentionally place more importance on results that show that his anti-fungal treatment works. In Example 2, Tebogo might expect more crime to occur around hubs of transport, such as the Gautrain station and bus stations, and unintentionally ignore crimes happening at other locations. In Example 1, Raheem might expect students to be dissatisfied with the cost of food on campus, and unintentionally pay less attention to the surveys of students who were satisfied.

Unlike some of the other biases, confirmation bias is something that all of us struggle with every day. Think about it: when you are watching sport, are you likely to think that the referee is being unfair to the team you expect to win, and ignore the penalties issued to the team you expect will lose? When you are planning an outdoor event, are you more likely to believe weather forecasts that predict the good weather you are hoping for? When you are reading reviews of skincare products, are you more likely to believe positive reviews on the brands you trust, and disregard negative ones? Are you more likely to trust the opinions of people who already have ideas that are similar to your own, as opposed to people who have different opinions? If confirmation bias leads us to make decisions that affect our health, financial decisions, or ideas about the world, it can affect us negatively.

  1. Historical bias occurs when past data is used that is inherently biased due to historical circumstances. This can lead to decisions that are not appropriate to the current day. In Example 2, if Tebogo obtained crime data from before the Hatfield CID was formed, this data would not enable her to make decisions about crime in 2025. In Example 1, if Raheem had data from before the COVID-19 pandemic, he would be misinformed about the food vendors available on Hatfield campus (following the COVID-19 pandemic, some food vendors closed and other new vendors opened businesses on campus). In the worst case, like confirmation bias, historical bias could be used to discriminate against people from certain demographic or religious groups. For example, historical data about approving bank loans or hiring employees in a company could discriminate based on race and gender, while past data about the adoption of children into stable homes could reflect historical stances on sexual orientation or single motherhood. Historical bias could potentially perpetuate inequality.

Discuss: What other examples of these different types of bias can you think of? How do you think they can enter a dataset? How would you avoid bias in data collection, and in daily life, to ensure informed choices?

Impact

The last aspect of evaluating data is that the data should not have a potentially harmful impact. Irrelevant or poor quality data could lead to incorrect, uninformed decisions, while unrepresentative and biased data could be directly harmful by perpetuating misinformation. Thus, ensuring that data is relevant, good quality, representative and unbiased goes a long way towards decreasing any potential harmful impact.

However, even relevant, good quality, representative data could be potentially harmful, depending on its nature. Sensitive data like public health data, crime data, or any data is not adequately anonymised could be potentially harmful if distributed incorrectly. A good dataset could be misused by an unqualified or malicious user. Thus, it is important to ensure that sensitive data is stored securely and only shared with those who have the relevant access rights.

Examples

Example 4 (continuation of Example 1): Raheem has finished collecting surveys from students about their perceptions of the food available on campus. He inspects his dataset using the key aspects above, and comes to the following conclusions:

  1. Relevance: Since this was primary data, it was collected by the investigator to answer his specific research question. The data is up-to-date. It is thus relevant.
  2. Quality: The surveys that were given to students were standardised. All students received a copy of the same survey, with clear instructions on how to answer each question. Thus, the dataset is consistent. Furthermore, nearly all of the respondents answered all the questions. Thus, the dataset is complete.
  3. Representativeness and unbiasedness: Students from all across Hatfield campus were asked to fill in the survey. This included students from different years of study, different degrees and different faculties, as well as diverse demographic and socio-economic backgrounds. Thus, the data represents the diverse student body on Hatfield campus. Furthermore, surveys were handed out to students at a variety of spots on campus, including far away from any food vendors, and regardless of whether or not students were eating purchased food, home-made food, or not eating at all. Thus, there was little if any bias.
  4. Impact: The data should not have a potentially harmful impact. The data was anonymised, so that students’ answers on the survey could not be linked to their identities in any way. Any mention of specific restaurants or food outlets was also removed, so that no student’s opinion could be linked to any existing vendor on Hatfield campus. Thus, there is very little chance of any potentially harmful impact on either students or food vendors.

Example 5 (continuation of Example 2):

  1. Relevance: Since this was secondary data, it is important to consider its relevance. Since Tebogo obtained the data from all police precincts overlapping with the CID, and obtained it for the specific timeframes she wants to study, the data is relevant.
  2. Quality: There was some missing data, but the data is complete enough to be used. Data was entered mostly consistently. The data quality is adequate.
  3. Representativeness and unbiasedness: Crime data is, by nature, somewhat unrepresentative, since only reported crimes are part of the dataset. Thus, certain crimes are less likely to be represented adequately. This could include minor crimes, like the theft of inexpensive items, or serious crimes where the victim is afraid to come forward, such as domestic violence. Therefore, Tebogo must account for the possible unrepresentativeness in the data, and make use of additional techniques or data, such as underreporting estimates, in her analysis.
  4. Impact: Since the police removed all data that could identify individuals, the crime datasets cannot be used to harm any individual. Still, care must be taken that the data is not accessed by anyone except Tebogo and the other authorised people in her company.

Example 6 (continuation of Example 3):

  1. Relevance: This is primary data collected by the researcher for his specific purpose, thus it is relevant.
  2. Quality: William meticulously collected the data and ensured that it was complete and consistent.
  3. Representativeness and unbiasedness: The data was collected in a carefully climate-controlled and contaminant-free environment. This removed the effect of any potential confounding factors on the results.
  4. Impact: Data showing the effectiveness of a treatment can be sensitive before the treatment is subjected to further testing and approval by government authorities. Uninformed individuals might try to use a similar treatment on their crops, and if the treatment has not been conclusively tested and approved, this may lead to bad outcome such as crops dying, or becoming unfit for consumption.

Class Exercise Question 3: Evaluate the dataset you collected in Class Exercise 1.

Sampling

When we collect data, it is almost never possible to collect data on the entire population. For instance, if we want to study the habits of people who shop at Checkers, it will not be feasible to send out a survey to everyone in South Africa who has ever shopped at Checkers. When we collect data on a subset of the population, this is called a sample. In cases where we are able to collect data on the whole population, this is called a census. The table below highlights the differences between censuses and samples.

## Warning: package 'knitr' was built under R version 4.3.3
Census vs. sample
Census Sample
Definition A complete enumeration of every individual in a population. A subset of individuals selected from a population.
Coverage Includes the entire population. Includes only a portion of the population.
Time Can be very time-consuming due to large-scale data collection. Requires less time since data is collected on fewer individuals.
Cost Usually quite expensive. Less expensive.
Accuracy Accurate if data is collected properly, but errors can still occur. May have some sampling error*.
Feasibility Difficult for large populations. More practical, especially if the population is large.

*Sampling error will be explained in a later section.

Although it is generally true that more data is better, there are many reasons to take a sample rather than a census. This includes time and financial constraints, as well as feasibility. For example, when taking a geological survey, it is really not feasible to measure the soil at every location in an area! As long as the sample is unbiased and representative, samples can be very informative and helpful.

Sampling Frames

Before we start drawing samples, we must first define the concept of a sampling frame. This is a complete list of all individuals or units in the population of interest from which a sample is drawn. A sampling frame is the foundation for selecting a sample that is representative of the population under study. You can think of a sampling frame as the “pool” from which elements of the sample is drawn.

Example 7: Suppose a market researcher wants to study the shopping habits of students in Pretoria. The population he is interested in are all students at tertiary institutions in Pretoria. The sampling frame would be a complete list of all students currently registered at tertiary institutions in Pretoria.

Exercise: In each of the scenarios given below, describe the population and the sampling frame.

  1. An animal scientist wants to determine the average weight of male lions in the Kruger Park.
  2. A human resources professional wants to know the median salary earned by accountants in her company.
  3. A high school principal wants to identify the best learners in Grade 10 mathematics at his school in 2025.
  4. A forester wants to wants to estimate the volume of merchantable timber of the pine trees on his plantation.

A good sampling frame must exhibit the following characteristics:

  1. Completeness: The sampling frame should include every member of the population of interest.
  2. Accuracy: The information in the sampling frame should be up-to-date and correct.
  3. No duplicates: Each member of the population should appear exactly once in the sampling frame.
  4. Relevance: The sampling frame should align with the research question and population of interest.

Exercise: In each of the scenarios given below, evaluate the given sampling frame in terms of its completeness, accuracy, duplicates and relevance.

  1. A researcher wants to survey registered voters in a city about their voting preferences. They use a voter registration list from two years ago. However, many people have moved away, passed away, or changed their voter registration since then.
  2. A market researcher wants to study phone usage among adults in a city. They use a landline phone directory as their sampling frame. However, many people, especially young adults, rely exclusively on cellphones and are not listed in the directory.
  3. A company wants to survey its customers about satisfaction with their products. They use a customer database, but some customers appear multiple times due to different email addresses or accounts (e.g., one customer might have made purchases under both “” and “”).
  4. A researcher wants to study pet ownership habits in a city. They use a list of employees from an inner-city corporation as their sampling frame. However, this list only includes people who work at that company, who may not be representative of the broader population (e.g., they may live in flats or complexes that discourage owning pets).

Once a good sampling frame has been constructed, we can proceed with taking a sample. The next two sections will consider different sampling methods.

Important Notation: Note that we will be using \(N\) to denote the population size, and \(n\) to denote the sample size.

Probabilistic Sampling

In probabilistic sampling, every individual in the population has a known and non-zero chance of being selected.

Example 8 (continuation of Example 1): When Raheem collected data from students regarding their campus food preferences, he was collecting a sample, since he was not surveying every single student. There were many ways for him to go about collecting this sample. Here are some of the ways he considered:

  1. He could have asked his niece Aaliyah, who is studying philosophy, to hand out surveys to her classmates. In this case, only philosophy students who are in Aaliyah’s class would have a non-zero chance of being selected. Engineering students, for example, would have a zero chance of being selected. Thus, this would be a non-probabilistic sample.
  2. He could have asked one of his friends, John, who is a lecturer in accounting, to hand out surveys to his students. In this case, only accounting students in John’s class would have a non-zero chance of being selected. Philosophy students, for example, would have a zero chance of being selected. Thus, this would be a non-probabilistic sample.
  3. He could have asked other owners of campus restaurants and food outlets to hand out surveys to their customers. In this case, students who buy food from food outlets on campus would be selected. If Raheem got the owners of all of the outlets on campus to hand out surveys, then all students who buy food on campus would have a non-zero chance of being selected. This would be a probabilistic sample, but NOT of the population Raheem is interested in. Recall that he wanted the opinions of students who do not buy food on campus. Those students would have zero chance of being selected.
  4. He could have liaised with university management to ensure surveys were sent out to all students via email. In this case, all students would have had a chance to answer the survey. This would be a probabilistic sample.
  5. He could have asked students to hand out surveys randomly to other students on campus. In this case, all students would, at least in theory, have had a chance to answer the survey. This would be a probabilistic sample.

Example 9: Thabang is a security manager at an airport. In order to reduce airport crime, he wants his staff to search travellers’ luggage. Since all travellers must pass through the security queues, and must also wait in the waiting area at their gate, he could search the luggage of everyone in the security queue, or everyone in the waiting area. However, this is not feasible, as it would take too much time and make people late for their flights. Thus, Thabang knows that he must take a sample of the travellers in the airport. He considers the following options:

  1. Select all travellers whose surnames begin with an A, an F or an N.
  2. Generate a sequence of non-repeating random numbers, e.g. 9, 24, 18, etc., and select travellers who are 9th, 24th, 18th, etc. in the security queue.
  3. Select travellers who are suspiciously in a hurry.
  4. Select travellers who have red suitcases.
  5. Select every 10th traveller in the security queue.
  6. Randomly select travellers from the waiting area at each airport gate.
  7. Randomly select waiting areas, and search the luggage of all travellers in that waiting area.

Exercise: Discuss each of Thabang’s proposed ways to sample travellers’ luggage, and comment on whether or not this option would constitute a probabilistic sample.

Image attribution: Designed by macrovector / Freepik

Image attribution: Designed by macrovector / Freepik

Simple Random Sampling

A simple random sample (SRS) is obtained if each element of the population that has not yet been included in the sample, has an equal chance of being selected in the next draw.

In Example 8, Option 2 is an example of a simple random sample. Here, every person in the security queue who has not yet been selected, has an equal chance of being selected.

Suppose there are 100 people in the queue, i.e. the population size is \(N=100\). Before Thabang generates a random number, each person’s chance of being selected is \[\frac{1}{N}=\frac{1}{100}.\] Now suppose Thabang wants a sample of size \(n=10\). He generates the first random number, 9. The 9th traveller’s luggage is searched, and they are excluded from being searched again. Now, the chance of every other person in the queue being selected is \(\frac{1}{99}.\) Thabang now generates another random number (excluding the number 9), and obtains the number 24. The 24th traveller’s luggage is searched, and they are again excluded from future searches. The chance of every other person in the queue being selected (i.e. everyone except the 9th and 24th travellers) is now \(\frac{1}{98}.\)

This process is repeated until Thabang has sampled as many travellers as he decided on (e.g., 10 travellers).

The procedure to collect a simple random sample is as follows:

  1. Number all \(N\) elements in the population.
  2. Decide on a sample size \(n\).
  3. Select \(n\) random numbers out of the numbers belonging to the population elements.
  4. Select the population elements corresponding to these random numbers.

The procedure to select random numbers is as follows:

  1. Select a random starting point from a table of random numbers.
  2. Divide consecutive single digits into groups, where the size of the groups is the same number of digits as the population size (\(N\)). Write down each of the numbers which is less than or equal to \(N\).
  3. Include the population elements with numbers that agree with these numbers.

Systematic Sampling

In a systematic sample, every \(k\)th element of the population is selected, after a random initial element is selected, where \(k=\frac{N}{n}\). Here, every element of the population has a \(\frac{n}{N}=\frac{1}{k}\) chance of being selected.

In the airport security example, Option 5 represents a systematic sample. Suppose there are now \(N=200\) travellers in the security queue, and that Thabang wants a sample of size \(n=20\). In order to take a systematic sample, he will first calculate \(k=\frac{N}{n}=\frac{200}{20}=10.\) He will then select a random number between 1 and \(k=10\), and select the corresponding traveller in the queue. Say the random number is 3. In this case, he will select the 3rd traveller. Thereafter, he will add \(k=10\) to this random number and select the corresponding traveller, i.e. the 13th traveller. He will repeat the process by selecting the 23rd, 33rd, etc. traveller until the 93rd traveller. He will then have his sample of size \(n=20\).

The procedure to collect a systematic sample is as follows:

  1. Number all \(N\) elements in the population.
  2. Decide on a sample size \(n\).
  3. Calculate the ratio \(k=\frac{N}{n}\), also called the sampling interval.
  4. Randomly select a number between 1 and \(k\) to determine the first individual in the sample.
  5. From this starting point, select every \(k\)th individual from the list.

Stratified Sampling

In stratified sampling, the population is divided into subgroups (strata), and a random sample is taken from each subgroup (stratum). In the airport security example, Option 6 constitutes stratified sampling. Suppose there are \(3\) waiting areas in the airport. These waiting areas represent the strata. Suppose Area 1 has \(N_1=150\) travellers, Area 2 has \(N_2=100\) travellers, and Area 3 has \(N_3=50\) travellers currently waiting. Thus, the total population size is \(N=N_1+N_2+N_3=300\). If Thabang wants to take a sample of \(n=30\) travellers, he has two different ways to select the sample size per waiting area.

His first option is called proportional stratified sampling, and involves choosing a sample of travellers from each waiting area such that the sample size for each area is proportional to its size in the population. For each waiting area, the sample size can be calculated as \(n_h=\frac{N_h}{N}\times n, h=1,2,3\). Using this formula, he would select \(n_1=\frac{150}{300}\times 30=60\) travellers from Area 1, \(n_1=\frac{100}{300}\times 10=60\) travellers from Area 2, and \(n_1=\frac{50}{300}\times 30=5\) travellers from Area 3. Note that \(n_1+n_2+n_3=30=n\).

His second option is equal stratified sampling, where the same number of individuals is chosen from each stratum, regardless of its size. In this case, \(n_1=n_2=n_3=\frac{n}{3}=\frac{30}{10}=3.\) This kind of sampling is used when it is more important to select the same number of elements from each stratum than to ensure each stratum is represented. In this example, it could lead to Area 1 being under-represented and Area 3 being over-represented in the sample.

The procedure to collect a stratified sample is as follows:

  1. Number all \(N\) elements in the population.
  2. Divide the population into mutually exclusive strata. Each individual should belong to one and only one stratum.
  3. Decide on a sample size \(n\).
  4. Decide whether to use proportional or equal stratified sampling, and consequently calculate the appropriate sample size per stratum.
  5. Select a random sample from each stratum using simple random sampling.

Stratified sampling is useful when each stratum is homogeneous, i.e. elements within strata are similar, but there are big differences between strata.

Definition of Homogeneous Data: Homogeneous data consists of elements that are similar or even identical, exhibiting little variation.

Examples:

  • Demographics: All of the Grade 11 girls on the netball team at a school. These learners will have the same gender, the same sport, similar ages, weights and heights.
  • Environmental data: Measurements of the soil pH of one wetland. The soil pH will not vary so much within one wetland.
  • Medical data: All of the women in the maternity ward of a hospital in a high-income area, between the ages of 20 and 30. These women will be similar to each other in terms of income, how many weeks they are due, and will be identical in gender.
  • Sales data: The sales records of stationary from the stationary shops in Pretoria. The sales records across all stationary shops will be fairly similar in terms of the products sold (pencils, paper, pens, notebooks, etc.) and the periods during which most sales are made (school supplies at the start of a new term, gifts and wrapping paper during the festive season).

Cluster Sampling

In cluster sampling, the population is divided into groups (clusters), similarly to stratified sampling. In stratified sampling, however, individuals are selected from each group, whereas in cluster sampling, the groups are selected randomly. In the airport security example, Option 7 is an example of a cluster sample. The waiting areas represent the clusters. To perform cluster sampling, Thabang would randomly select one or two of the waiting areas. Then, he could either perform one-stage cluster sampling, in which case he would select all of the individuals in each cluster. Or, he could perform two-stage cluster sampling, whereby he would sample random individuals from each cluster using simple random sampling. In one-stage cluster sampling, it may not be possible to select a precise sample size, since the size of the selected cluster(s) will determine the size of the sample. In two-stage cluster sampling, the sample size can be enforced more easily. For example, if he wanted a sample of size \(n=20\), and selected Areas 1 and 2, he could randomly select \(10\) individuals from Area 1 and \(10\) individuals from Area 2.

  1. Number all \(N\) elements in the population.
  2. Divide the population into mutually exclusive clusters. Each individual should belong to one and only one cluster.
  3. Decide on the number of clusters to sample.
  4. Decide whether to use one-stage or two-stage cluster sampling. If two-stage cluster sampling is selected, decide on a sample size \(n\).
  5. Select a random sample from each stratum using simple random sampling.

Cluster sampling is useful when each cluster is heterogeneous, i.e. elements within clusters are different from each other, but there are no big differences between clusters.

Definition of Heterogeneous Data: Heterogeneous data consists of elements that are substantially different from each other, exhibiting a considerable amount of variation.

Examples:

  • Demographics: All of the learners in a school, from Grade 1 to Grade 12. These learners will differ substantially from each other in terms of gender, age, height, weight and the sports they prefer.
  • Environmental data: Soil pH measured across an entire city that has clay-like, sandy and rocky soil. The soil pH will differ substantially based on where in the city each measurement was taken.
  • Medical data: All of the patients in the west wing of a hospital that includes maternity wards, oncology, and an emergency room. These patients will differ from each other in terms of their health conditions, age and gender.
  • Sales data: Sales records of grocery shops across countries in the northern and southern hemispheres. These sales records will differ vastly in terms of the kinds of food and supplies sold, as well as when which kind of food will be sold. For example, hearty, rich food will sell better in December in the northern hemisphere, and in July in the southern hemisphere; some countries will not sell pork or alcohol products at all, whereas those same products will be very popular in other countries; some countries will sell specific foods during certain festivals, etc.

Probabilistic Sampling Summary

No probabilistic sampling method is necessarily always better than another. It is important to select the appropriate sampling method based on the problem you are trying to solve, and the nature of the data. The table below summarises the characteristics of each probabilistic sampling method, and lists some of their advantages and disadvantages.

Probabilistic sampling summary
Sampling Method Description Example Advantages Disadvantages
Simple Random Sampling (SRS) Every individual in the population has an equal chance to be selected. Randomly selecting travellers in the security queue. Selection bias is minimised; Easy to understand and implement Difficult for large populations; Risk of underrepresenting some groups
Systematic Sampling After a random start, every kth individual is selected. Choosing every 10th traveller in the security queue. Easier and quicker than SRS; Ensures even coverage of the population May not be fully random if there is an underlying pattern in the data (e.g., if people are queueing such that every 10th person has a large suitcase, only people with large suitcases will be selected)
Stratified Sampling The population is divided into strata, and a random sample is taken from each stratum. Sampling a proportional number of travellers from each waiting area. Ensures all groups are represented; Can be more reliable than SRS when strata are very different from each other Needs a more in-depth understanding of the population to define suitable strata
Cluster Sampling The population is divided into clusters, and entire clusters are randomly selected. In two-stage cluster sampling, samples are taken from the selected clusters. Choosing waiting areas at random and then selecting all travellers in each selected area. Practical and cost-effective compared to SRS and Systematic Sampling; Good to use when clusters are naturally occurring groups, e.g. different waiting areas, or schools, or companies Clusters may not be representative; Naturally occurring clusters will not necessarily be internally heterogeneous but similar to other clusters

Non-Probabilistic Sampling

Non-probabilistic sampling occurs when individuals are selected based on convenience or judgment. This means that not every individual has a known or non-zero chance of being chosen. This can introduce bias and a lack of representativeness. However, it can be useful in cases where we do not need a random or representative sample, or where it is infeasible to take a probabilistic sample due to limitations of accessibility, and where ease of access to individuals is important.

Here, we will consider three types of non-probabilistic sampling, namely Convenience Sampling, Judgment Sampling and Quota Sampling.

In convenience sampling, individuals are selected based on how easy they are to reach.

Example 10: Suppose an animal scientist is attempting to study lions in a nature park. Attempting to take a probabilistic sample would require knowledge of how many lions are in the park, which might not be known, since lions could die or be born without the scientist’s knowledge. She might instead study those lions who come to drink at a water hole that is accessible by Jeep. This would exclude lions who drink at rivers, or at water holes that are not accessible to her vehicle. However, it may not be possible for her to study lions she cannot access. In this case, convenience sampling would be appropriate, although there is no way of knowing the chance that each lion has to be selected, and some lions have zero chance of being selected.

In judgment sampling, individuals are selected based on experts’ decisions of which individuals would be most useful for the study.

Example 11: Suppose a conference organiser is tasked with assembling a panel to discuss banking in South Africa. Rather than taking a probabilistic sample of the CEOs and other top officers of South African banks, he might choose to invite only those whom he personally believes will make the most meaningful contribution to the discussion. Although this is a non-probabilistic way of sampling, it can be more targeted and effective in certain scenarios.

In quota sampling,individuals are selected to fulfill predetermined quotas for specific subgroups. It ensures that certain characteristics (e.g., age, gender, occupation) are proportionally represented. However, selection within those groups is not random.

Example 12: A survey of school learners might ensure that 20 learners from each grade answer the survey by asking a prefect to go into a classroom of each grade, and handing out the survey to the first 20 learners in each class that want to fill it in. This is a quick and convenient way to ensure that each grade is adequately represented. However, since it is non-probabilistic, the study might still exhibit various kinds of sampling bias. We will talk about sampling bias in the next section.

Sampling Bias

Regardless of the sampling method we plan to use, it is always important to be aware of the dangers of sampling bias. Sampling bias occurs when some members of a population are systematically more likely to be selected in a sample than others, leading to a sample that is not representative of the entire population. This can distort the results of a study or analysis, making them unreliable or misleading.

Example 13: Caitlyn is the owner of a pet shop called Paws & Whiskers. She wants to gauge whether her customers are satisfied with the range of products stocked in her shop. If she only sends out surveys to customers who purchase dog food, this would systematically exclude all pet owners who have other pets, such as cats, reptiles, rabbits, mice, birds or fish.

There are six types of sampling bias that we will consider.

1. Selection Bias

This type of bias occurs when certain population groups are systematically excluded or underrepresented in the sample. The pet shop example above illustrates selection bias.

Q: What other kinds of selection bias could occur? Who would be excluded if Caitlyn only sent out surveys to customers who made online purchases, or customers with a loyalty card?

2. Voluntary Response Bias

Voluntary response bias takes place when the individuals who participate in a study are self-selected. Typically, this will lead to the inclusion of only those individuals who have strong opinions and want to be heard, and will exclude individuals with more moderate opinions.

If Caitlyn posts her survey on social media, for instance, without encouraging all of her customers to participate, most of the responses will be from customers who are very unhappy, and perhaps a few customers who are extremely happy with her products.

Think about it: how often do you rate deliveries, apps, or other consumer experiences? Most of us will skip the rating step unless we are either very dissatisfied, or extremely happy with the experience.

3. Survivorship Bias

This kind of bias takes place when only “survivors” or of a population are considered, and those who have dropped out or failed are ignored.

Suppose Caitlyn is studying other pet shops to find out what she could do to improve her business. If she only studies successful pet shops, she might conclude that stocking dog food is all she needs to do in order to remain successful. However, this would ignore all of the pet shops who have had to downscale or close - all of whom have stocked dog food! Clearly, she would be in danger of drawing incorrect conclusions.

Another well-known example of survivorship bias occurred during World War II. American researchers were attempting to understand where bomber aircraft were most vulnerable, and reinforce those vulnerable areas to reduce the number of bombers that were being shot down. To do this, they initially studied damaged bombers to see where they had been hit. An example of such a bomber is shown in Figure 3, with the red dots representing bullet holes. However, they soon realised that this was an example of survivorship bias. The bullet holes in the bombers they were studying represented areas where bombers could be shot and still fly well enough to return to base. Bombers that had been hit in other places (like the fuselage) had been shot down over enemy territory, and did not return to base. Based on this, the scientists suggested that the areas should be reinforced that were not damaged on bombers that had returned. The scientists’ ability to understand survivorship bias was thus able to save many pilots’ lives.

Figure 3: Illustration of survivorship bias in World War II planes. Image from Wikipedia: https://en.wikipedia.org/wiki/Survivorship_bias#/media/File:Survivorship-bias.svg/2

Figure 3: Illustration of survivorship bias in World War II planes. Image from Wikipedia: https://en.wikipedia.org/wiki/Survivorship_bias#/media/File:Survivorship-bias.svg/2

4. Time Interval Bias

This bias occurs when the data collected are influenced by the time period during which the sample is collected.

In the pet shop example, time interval bias would occur if Caitlyn collected data on dog jacket sales during summer. She might conclude that dog jackets are not a popular item, when in fact they are very popular in cold weather.

5. Convenience Sampling Bias

As the name suggests, this bias goes hand-in-hand with convenience sampling. When samples are taken only from a group that is easily accessible, this may not represent the general population.

In the pet shop example, Caitlyn might pose questions on her products to customers who are browsing the shop and are not in a hurry. This would be convenient, as she would be talking to relaxed customers who were in a good mood. However, this would exclude all of the customers who were in a hurry, or those who were in a bad mood because they could not find the product they were looking for! In this way, she would not obtain a representative sample of her customers.

6. Non-Response Bias

This kind of bias occurs when there is a substantial difference between individuals who respond to a survey, and those who do not. The effect of non-response bias can be similar to voluntary response bias. The difference is that in voluntary response bias, individuals are not selected in a random way. Thus, individuals are typically excluded if they do not have a strong opinion on the survey. In non-response bias, a proportion of the selected individuals decline to respond.

In the pet shop example, non-response bias could occur if Caitlyn selected a random sample of customers, and then phoned them during work hours. Customers with busy jobs would be more likely to decline her call, whereas those with lower intensity jobs, or those who were not employed, would be more likely to answer her questions.

Sampling Bias Summary

In summary, sampling bias can lead to a whole host of errors causing a sample to be unrepresentative of the population. If a researcher assumes that an unrepresentative sample is in fact representative, they could make very incorrect conclusions. These conclusions could be ineffective, or even harmful (the World War II plane example shows just how harmful this can be!).

It is therefore very important to understand and minimise sampling bias as much as possible. Properly designing a probabilistic sample can reduce most types of sampling bias. Additionally, non-response bias can be reduced by following up on those individuals who did not initially respond to the survey.

There are cases where non-probabilistic samples are acceptable for the purpose of the study at hand. However, the researcher must be aware of the fact that their sample does not necessarily reflect the population, and be careful when attempting to apply sample-based conclusions to the population. In the lion example, for instance, it might be acceptable for the researcher to study only those lions who drink at a waterhole that is accessible by Jeep. But, she would have to acknowledge this as a limitation in her study, and be careful of applying her conclusion to all lions. The lions who drink at the waterhole, for example, are able to drink enough water and do not suffer from dehydration. However, it would not be correct of her to assume that all of the lions in the park are properly hydrated, since there might be other lions at other locations in the park who do not have sufficient access to drinking water.

Sampling Bias versus Sampling Error

Finally, it is important to distinguish between sampling bias and sampling error. As explained previously, sampling bias occurs when individuals are excluded from the sample in some systematic way. This can be mitigated by improving the sampling design.

Sampling error, on the other hand, is a type of error that happens purely by chance. This error occurs because samples will almost never be perfectly representative of the population.

In the airport security example, Thabang could have a very well-designed sample, but could still miss a traveller who has a dangerous item in their luggage.

In the pet shop example, Caitlyn might conclude that 82% of her customers are satisfied with her products, based on a representative, probabilistic sample. However, the real number based on the population might be 80% or 85%.

The size of the sampling error can be estimated by using statistical techniques. For example, Caitlyn might be able to calculate that there is a 5% fluctuation in her results. In that case, even if the sample indicates that 82% of her customers are satisfied with her products, she will know that the true number could be as low as 77% or as high as 87%.

Nearly all samples will exhibit some degree of sampling error. This can be mitigated by increasing the sample size.

Sampling bias versus sampling error
Type of Error Sampling Error Sampling Bias
Cause Random chance Systematic problem in the sampling method
Effect Estimates will vary slightly Systematic error in the results
How to Mitigate Increase the sample size Redesign the study
Randomness of Error Random (cannot be avoided) Systematic (can be avoided)
Severity Not necessarily severe; will always occur Very severe - can have harmful consequences unless the study is redone
Example A survey finds that 82% of customers are happy when the true number is 84% The World War II bomber example